read_csv('A,B
12:00, 12:00
14:30, midnight
20:01, noon')# A tibble: 3 × 2
A B
<time> <chr>
1 12:00 12:00
2 14:30 midnight
3 20:01 noon
Lecture 6:
Non-rectangular data
2024-10-24
readr, read_csv(), read_delim() to parse csvs
tibble.guess_parser()tidyverse()read.csv(): “invalid multibyte string at read_delim(): more robust to encoding issues!With iconv(), we found the right encoding for \xF6. We could then import the data using
However, because of the special character, the column was read as a character. We replaced the value and coerced the whole column to numeric.
Using:
“If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw.”
This question checks your understanding of the coercion of boolean values.
What is the output of the following code?
"TRUE" "FALSE" "TRUE" "FALSE"TRUE FALSE TRUE FALSE1 0 1 00 1 0 1Consider the data frame
Which of these statements are TRUE?
dataCHframe$PartyCenter <- c(25, 20, 15) creates a new variable called “PartyCenter”dim(dataCHframe[, dataCHframe$PartyLeft > 40]) returns the same as dim(dataCHframe[, c(2,3)])dim(dataCHframe[dataCHframe$PartyLeft > 40 | dataCHframe$PartyLeft < 40, ]) returns c(3,3)dataCHframe is a data.frame, which is a list consisting of one named character vector and two named integer vectors.You want to import a file using read_delim(). Describe what read_delim() does under the hood. What should be added to this command in order for it to work?
Consider the following code
Are these statements TRUE or FALSE?
mean(df$a) == 2.5typeof(as.matrix(df)[,1]) is numeric (or double)as_tibble(df)[1:2, 1] contains the same information as df[1:2, 1] (FALSE, because tibbles do not simplify when subsetting. Not exam relevant.)Today
- “Non-Rectangular Data in Economic Research” with Minna Heim
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
<?xml version="1.0" encoding="UTF-8"?>
<data>
<row>
<unique_id>216498</unique_id>
<indicator_id>386</indicator_id>
<name>Ozone (O3)</name>
<measure>Mean</measure>
<measure_info>ppb</measure_info>
<geo_type_name>CD</geo_type_name>
<geo_join_id>313</geo_join_id>
<geo_place_name>Coney Island (CD13)</geo_place_name>
<time_period>Summer 2013</time_period>
<start_date>2013-06-01T00:00:00</start_date>
<data_value>34.64</data_value>
</row>
<row>
<unique_id>216499</unique_id>
<indicator_id>386</indicator_id>
...
</row>
</data>An XML document begins with some information about XML itself. For example, it might mention the XML version that it follows. This opening is called an XML declaration.
The “row-content” is nested between the ‘row’-tags:
There are two principal ways to link variable names to values.
<?xml version="1.0" encoding="UTF-8"?>
<dataset>
<variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
<filename>ISCCPMonthly_avg.nc</filename>
<filepath>/usr/local/fer_data/data/</filepath<|meta_end|>filepath>
<badflag>-1.E+34</badflag>
<subset>48 points (TIME)</subset>
<longitude>123.8W(-123.8)</longitude>
<latitude>48.8S</latitude>
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
</dataset>There are two principal ways to link variable names to values.
Attributes-based:
Tag-based:
<cases>
<case>
<date>16-JAN-1994<date/>
<temperature>9.200012<temperature/>
<case/>
<case>
<date>16-FEB-1994<date/>
<temperature>10.70001<temperature/>
<case/>
<case>
<date>16-MAR-1994<date/>
<temperature>7.5<temperature/>
<case/>
<case>
<date>16-APR-1994<date/>
<temperature>8.100006<temperature/>
<case/>
<cases/>Potential drawback of XML: inefficient storage
<?xml version="1.0" encoding="UTF-8"?>
<company_dundermifflin>
<person id="1">
<name>Michael Scott</name>
<position>Regional Manager</position>
<location branch="Scranton"/>
</person>
<person id="2">
<name>Dwight Schrutte</name>
<position>Assistant (to the) Regional Manager</position>
<location branch="Scranton"/>
<orders>
<sales>
<units>10</units>
<product>paper A4</product>
</sales>
</orders>
</person>
<person id="3">
<name>Jim Halpert</name>
<position>Sales Representative</position>
<location branch="Scranton"/>
<orders>
<sales>
<units>20</units>
<product>paper A4</product>
</sales>
<sales>
<units>5</units>
<product>paper A3</product>
</sales>
</orders>
</person>
</company_dundermifflin>{xml_document}
<company_dundermifflin>
[1] <person id="1">\n <name>Michael Scott</name>\n <position>Regional Manager</position>\n <lo ...
[2] <person id="2">\n <name>Dwight Schrutte</name>\n <position>Assistant (to the) Regional Mana ...
[3] <person id="3">\n <name>Jim Halpert</name>\n <position>Sales Representative</position>\n < ...
‘company_dundermifflin’ is the root-node, ‘persons’ are its children:
{xml_nodeset (3)}
[1] <person id="1">\n <name>Michael Scott</name>\n <position>Regional Manager</position>\n <lo ...
[2] <person id="2">\n <name>Dwight Schrutte</name>\n <position>Assistant (to the) Regional Mana ...
[3] <person id="3">\n <name>Jim Halpert</name>\n <position>Sales Representative</position>\n < ...
{xml_node}
<name>
{xml_nodeset (1)}
[1] <person id="1">\n <name>Michael Scott</name>\n <position>Regional Manager</position>\n <lo ...
{xml_nodeset (2)}
[1] <person id="2">\n <name>Dwight Schrutte</name>\n <position>Assistant (to the) Regional Mana ...
[2] <person id="3">\n <name>Jim Halpert</name>\n <position>Sales Representative</position>\n < ...
{xml_nodeset (1)}
[1] <company_dundermifflin>\n <person id="1">\n <name>Michael Scott</name>\n <position>Reg ...
{xml_nodeset (1)}
[1] <company_dundermifflin>\n <person id="1">\n <name>Michael Scott</name>\n <position>Reg ...
{xml_nodeset (3)}
[1] <sales>\n <units>10</units>\n <product>paper A4</product>\n</sales>
[2] <sales>\n <units>20</units>\n <product>paper A4</product>\n</sales>
[3] <sales>\n <units>5</units>\n <product>paper A3</product>\n</sales>
[1] "10paper A4" "20paper A4" "5paper A3"
{xml_nodeset (3)}
[1] <position>Regional Manager</position>
[2] <position>Assistant (to the) Regional Manager</position>
[3] <position>Sales Representative</position>
# Extract all <person> nodes with sales information
# [orders] is a predicate that filters the <person> elements to only those that contain at least one <orders> child node.
persons_with_sales <- xml_find_all(xml_doc, "//person[orders]")
# Initialize an empty dataframe to store sales data
sales_df_final <- data.frame(name = NA, units = NA, product = NA)
for (i in 1:2) { # There are two persons with sales here.
person <- persons_with_sales[i]
# Extract person's name
name <- xml_text(xml_find_first(person, "./name"))
# Extract sales nodes
sales <- xml_find_all(person, ".//sales")
units <- xml_text(xml_find_all(sales, "./units"))
product <- xml_text(xml_find_all(sales, "./product"))
# Create a data frame for the person's sales
sales_person <- data.frame(
name = name,
units = as.numeric(units),
product = product
)
sales_df_final <- rbind(sales_df_final, sales_person)
}
sales_df_final name units product
1 <NA> NA <NA>
2 Dwight Schrutte 10 paper A4
3 Jim Halpert 20 paper A4
4 Jim Halpert 5 paper A3
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>JSON:
{"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumber": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
],
"gender": {
"type": "male"
}
}List of 6
$ firstName : chr "John"
$ lastName : chr "Smith"
$ age : int 25
$ address :List of 4
..$ streetAddress: chr "21 2nd Street"
..$ city : chr "New York"
..$ state : chr "NY"
..$ postalCode : chr "10021"
$ phoneNumber:'data.frame': 2 obs. of 2 variables:
..$ type : chr [1:2] "home" "fax"
..$ number: chr [1:2] "212 555-1234" "646 555-4567"
$ gender :List of 1
..$ type: chr "male"
The nesting structure is represented as a nested list: